First import the library

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(forcats))
library(gapminder)
library(kableExtra)
library(knitr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Part 1: Factor management

Elaboration for the gapminder data set

1.Drop Oceania

First, filter the Gapminder data to remove observations associated with the continent of Oceania.

noOceania <- gapminder %>% 
  filter(continent != "Oceania")

Then remove unused factor levels.

dropOceania <- noOceania %>% 
  droplevels()

Let’s see what their difference in the number of rows and levels:

tibble1 <- tibble(name = c("gapminder", "noOceania", "dropOceania"), num_of_level = c(nlevels(gapminder$continent),nlevels(noOceania$continent),nlevels(dropOceania$continent)), num_of_row = c(nrow(gapminder), nrow(noOceania), nrow(dropOceania)))
knitr::kable(tibble1) %>% 
  kable_styling(bootstrap_options = "bordered",latex_options = "basic",full_width = F)
name num_of_level num_of_row
gapminder 5 1704
noOceania 5 1680
dropOceania 4 1680

We can see from the tibble above that after filtering the Oceania from gapminder, the number of level in continent doesn’t change, only the number of rows gets smaller. After dropping the unused level, the number of level changes to 4, and the number of rows also falls down to 1680.

Then let’s check the level of each data set:

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(noOceania$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(dropOceania$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"

So with droplevels( ) the Oceania level is dropped.

2.Reorder the levels of country or continent

2.1 reorder by fct_reorder()

Let’s order the continent factor by the largest lifeExp in a descending order:

reOrder <- gapminder %>% 
  group_by(continent) %>% 
  summarise(maxLifeExp = max(lifeExp)) %>% 
  mutate(continent = fct_reorder(continent, maxLifeExp, max, .desc = TRUE))
levels(reOrder$continent)
## [1] "Asia"     "Europe"   "Oceania"  "Americas" "Africa"

Let’s check the order by using arrange( ):

arran <- gapminder %>% 
  group_by(continent) %>% 
  summarise(largest_lifeExp = max(lifeExp)) %>% 
  arrange(desc(largest_lifeExp)) 
  knitr::kable(arran) %>% 
  kable_styling(bootstrap_options = "bordered",latex_options = "basic",full_width = F)
continent largest_lifeExp
Asia 82.603
Europe 81.757
Oceania 81.235
Americas 80.653
Africa 76.442

We can see it’s the same order as the result above.

2.2 compare arrange() and fct_reorder()

Let’s see the order of level in reOrder and arran:

levels(reOrder$continent)  #order of factor after using fct_reorder()
## [1] "Asia"     "Europe"   "Oceania"  "Americas" "Africa"
levels(arran$continent) #order of factor after using arrange()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

In reOrder, with the using of fct_reorder( ), the order of levels changes into the expected order: a descending max lifeExp. In arran, with the using of arrange( ), the order of levels doesn’t change.

2.3 The effects of fct_reorder() and arrange() on the figure

We have seen the effect of fct_reorder( ) and arrange( ) on the orderof level, now let’s see the effect on the figure. First, plot the max lifeExp in each Asian countries with fct_reorder( ):

gapAsia <- gapminder %>% 
  filter(continent == "Asia") %>% 
  group_by(country) %>% 
  summarise(maxLifeExp = max(lifeExp))
gapAsia %>% 
  ggplot(aes(maxLifeExp, fct_reorder(country, maxLifeExp))) + geom_point(aes(color = country)) + xlab("Max LifeExp") + ylab("country") + ggtitle("Max LifeExp in Asian Countries")

We can see the order of country has changed in the figure. Then let’s use arrange( ):

gapAsia %>% 
  arrange(maxLifeExp) %>% 
  ggplot(aes(maxLifeExp, country)) + geom_point(aes(color = country)) + xlab("Max LifeExp") + ylab("country") + ggtitle("Max LifeExp in Asian Countries")

This time the order of country doesn’t change because the arrange( ) can’t change the order of level in country. Then let’s use arrange( ) and fct_reorder( ) :

gapAsia %>% 
  arrange(maxLifeExp) %>% 
  ggplot(aes(maxLifeExp, fct_reorder(country, maxLifeExp))) + geom_point(aes(color = country)) + xlab("Max LifeExp") + ylab("country") + ggtitle("Max LifeExp in Asian Countries")

The order of country has changed. So if we use fct_reorder( ), or combine with arrange( ), the order of level will change in the figure. However, if we only use arrange( ), the order of level in the figure will not change.

Part 2: File I/O

First, create a new data frame, by filtering the Asian country with their max lifeExp more than 75 years old:

df <- gapminder %>% 
  filter(continent == "Asia" & lifeExp > 75) %>% 
  group_by(country) %>% 
  summarise(maxLifeExp = max(lifeExp))
 knitr::kable(df) %>% 
  kable_styling(bootstrap_options = "bordered",latex_options = "basic",full_width = F)
country maxLifeExp
Bahrain 75.635
Hong Kong, China 82.208
Israel 80.745
Japan 82.603
Korea, Rep. 78.623
Kuwait 77.588
Oman 75.640
Singapore 79.972
Taiwan 78.400

Then, write/read the dataframe into/from a file:

1.write_csv()/read_csv()

write_csv(df,"df.csv")
readDf <- read_csv("df.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   maxLifeExp = col_double()
## )
readDf
## # A tibble: 9 x 2
##   country          maxLifeExp
##   <chr>                 <dbl>
## 1 Bahrain                75.6
## 2 Hong Kong, China       82.2
## 3 Israel                 80.7
## 4 Japan                  82.6
## 5 Korea, Rep.            78.6
## 6 Kuwait                 77.6
## 7 Oman                   75.6
## 8 Singapore              80.0
## 9 Taiwan                 78.4

We can see the after using write_csv( )/read_csv( ) country change from factor to character

2.saveRDS()/readRDS()

saveRDS can save a single object to the file:

saveRDS(df,"df.rds")
readRds <- readRDS("df.rds")
readRds
## # A tibble: 9 x 2
##   country          maxLifeExp
##   <fct>                 <dbl>
## 1 Bahrain                75.6
## 2 Hong Kong, China       82.2
## 3 Israel                 80.7
## 4 Japan                  82.6
## 5 Korea, Rep.            78.6
## 6 Kuwait                 77.6
## 7 Oman                   75.6
## 8 Singapore              80.0
## 9 Taiwan                 78.4

We can see the after using saveRDS( )/readRDS( ) country is still a factor

3.dput()/dget()

dput( ) writes an ASCII text representation of an R object to a file or connection, or uses one to recreate the object.

dput(df,"df.R")
readDput <- dget("df.R")
readDput
## # A tibble: 9 x 2
##   country          maxLifeExp
##   <fct>                 <dbl>
## 1 Bahrain                75.6
## 2 Hong Kong, China       82.2
## 3 Israel                 80.7
## 4 Japan                  82.6
## 5 Korea, Rep.            78.6
## 6 Kuwait                 77.6
## 7 Oman                   75.6
## 8 Singapore              80.0
## 9 Taiwan                 78.4

So atfer using dput( )/dget( ), country is still a factor

Part 3: Visualization design

1.remake the figure

Let’s first look at a previous plot which show the histograms of lifeExp for each continent:

ggplot(gapminder, aes(lifeExp)) + facet_wrap( ~ continent, scales = "free_x") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So we can see the first thing needed to be improved is that the histogram can only give us distribution information on lifeExp, so let’s change to a point plot to show the trend of lifeExp from year 1950s to 1990s:

# filter the data
newGap <- gapminder %>% 
  filter(year >= 1950 & year <= 1999)
# make y scale free, change the color to a colour-blind friendly scheme, change the breaks
(point <- newGap %>% 
  ggplot(aes(year, lifeExp)) + facet_wrap( ~ continent, scales = "free_y") + geom_point(aes(color = lifeExp),alpha = 0.3) + labs(title = "lifeExp for 5 continent from 1950s~1990s") +  scale_color_viridis_c(trans="log10",breaks  = 10*(1:8)))

change a theme:

(point_new <- point + theme_minimal())

Differnces: we can see a rough trend and distributions for the lifeExp in each continent in the new plot. Also the color is changing according to the lifeExp so it becomes easier to see which continent has a highier lifeExp in a free-y-scale facet plot. In order to make the plot seems simpler, the theme change to minimal by using theme_minimal( )

But if we want to look at the accurate distribution of each year, we can use box plot:

(box_plot <- newGap %>% 
  ggplot(aes(year, lifeExp)) + facet_wrap( ~ continent, scales = "free_y") + geom_boxplot(fill = "blue",color = "orange",outlier.color = "blue", alpha = 0.3, aes(group = year)) + theme_minimal() + labs(title = "lifeExp for 5 continent from 1950s~1990s"))

Differnces: through a box plot, we can clearly see the minimum, first quartile, median, third quartile, maximum, as well as the trend during the years. However, we still can’t see the accurate data on the plot, so we need to convert to plotly.

2.convert to plotly

For the first point plot, convert ggplot to plotly by ‘ggplotly()’:

ggplotly(point_new)

For the second box plot, convert ggplot to plotly by ‘ggplotly()’:

ggplotly(box_plot)

Unlike ggplot, plotly makes interactive, publication-quality graphs online. Readers can interact with the plot in various ways through teh tool bar above the plot. Also the data value will be shown in the window when the pointer move to the the data.

Then, let’s try plot_ly( ) to make a 3D plot:

newGap %>% 
  plot_ly(x = ~year,
          y = ~continent,
          z = ~lifeExp,
          type = "scatter3d",
          mode = "markers",
          marker = list(size = 3.5, color = ~lifeExp, colorscale = 'Viridis'),
          opacity = 0.3)

In the 3D plot, we can only gain the actual value of each data, but also combine the data in 5 continent into one plot, also by changing the view coordinate, we can also check the data in a single continent.

Part 4: Writing figures to file

Use ggsave( ) to explicitly save a plot to file. ggsave( ) is a convenient function for saving the last plot that displayed. So let’s first plot a graph:

(save_plot <- gapminder %>% 
  ggplot(aes(continent, gdpPercap)) + scale_y_log10() + geom_boxplot(aes(fill = continent),alpha = 0.5))

Then, since the ggsave( ) guesses the type of graphics device from the extension. This means the only argument you need to supply is the filename, but in order to play around with various options in ggsave( ), I will use .png format:

ggsave("save_plot_1.png")
## Saving 7 x 5 in image

Then try changing the width and height of the saving image:

ggsave("width8_height6.png", width = 8, height = 6)

change the resolution of the saving image:

ggsave("dpi_72.png", dpi = 72)
## Saving 7 x 5 in image

change the scale of the image:

ggsave("scale_0.6.png", scale = 0.6)
## Saving 4.2 x 3 in image

try writing the image to a vector format pdf:

ggsave("vector_image.pdf")
## Saving 7 x 5 in image

Although the ggsave( ) will save the last plot that displayed, when we want to save other previous image, we need to specify the image we want to save. For example, if we want to save the box_plot:

ggsave("box_plot.png", plot = box_plot)
## Saving 7 x 5 in image

After adding the plot name in ggsave, we can save the image we want.